Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50853][CORE] Close temp shuffle file writable channel #49531

Conversation

ChenMichael
Copy link

What changes were proposed in this pull request?

Currently, there are two implementations of DownloadFileWritableChannel (which is used for writing data fetched to disk), SimpleDownloadWritableChannel and EncryptedDownloadWritableChannel. The latter closes the writable channel in it's implementation of closeAndRead method while the former does not. As a result, SimpleDownloadWritableChannel channel is never closed and is relying on either the finalizer in FileOutputStream or the phantom cleanable in FileDescriptor to close the file descriptor. The change in this PR is to close the channel in SimpleDownloadWritableChannel closeAndRead method.

Why are the changes needed?

Should be closing file handles when they are not needed anymore instead of relying on finalizer/cleanables to do it.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing spark tests.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Jan 16, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR. This sounds like a bug fix. Do you think if we can make a unit test for your claims, @ChenMichael ?

Should be closing file handles when they are not needed anymore instead of relying on finalizer/cleanables to do it.

@dongjoon-hyun
Copy link
Member

cc @mridulm

@LuciferYang
Copy link
Contributor

LuciferYang commented Jan 17, 2025

It seems that this code path does not have sufficient test coverage. I added a log statement in the closeAndRead method and then conducted tests on the network-shuffle and core modules locally. I found that:

  1. The tests in the network-shuffle module do not cover this code path
  2. There are a few test cases in the core module that indirectly cover this path (such as some test cases in DistributedSuite), but it is not easy to add new assertions.

So, could you add a targeted test case in this pull request? @ChenMichael Thanks ~

@LuciferYang LuciferYang changed the title [SPARK-50853][CORE] - Close temp shuffle file writable channel [SPARK-50853][CORE] Close temp shuffle file writable channel Jan 17, 2025
@ChenMichael
Copy link
Author

Yea, I was finding it difficult to write a unit test for this, so the way I tested was by launching spark shell, running some queries that would force shuffles and looking at the number of open file handles for the executors through lsof. I'll spend some time trying to come up with test cases for it.

@ChenMichael
Copy link
Author

Not sure if there's a better way to demonstrate the problem, but I added a test case that is kind of showing how the channel isn't closed through fetchBlocks flow at least. I had to make a bunch of things public though. Let me know if there's a better way to test this.

@@ -130,6 +132,62 @@ class NettyBlockTransferServiceSuite
assert(hitExecutorDeadException)
}

test("SPARK-50853 - example of simple download file writable channel not being closed") {
Copy link
Contributor

@LuciferYang LuciferYang Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think this test case is a bit heavy. According to the pr description, perhaps the new case just needs to ensure that after calling the closeAndRead() method on SimpleDownloadWritableChannel, the isOpen() status should return false (Before this PR, this assertion would fail.).

For example, maybe we could add a very simple case in the network-shuffle module:

@Test
    public void testIsOpenAfterCallCloseAndRead() throws IOException {
        File tempFile = File.createTempFile("test", ".tmp");
        tempFile.deleteOnExit();
        Map<String, String> confMap = new HashMap<>();
        // Perhaps some additional configuration options need to be put into the confMap.
        TransportConf shuffle = new TransportConf("shuffle", new MapConfigProvider(confMap));

        DownloadFile downloadFile = null;
        try {
            downloadFile = new SimpleDownloadFile(tempFile, shuffle);

            DownloadFileWritableChannel channel = downloadFile.openForWriting();

            // ...
            // ... Perhaps some other operations are needed.
            channel.closeAndRead();

            Assertions.assertFalse(channel.isOpen(), "Channel should be closed after closeAndRead.");
            // ... Perhaps some additional assertions need to be added.
        } finally {
            if (downloadFile != null) {
                downloadFile.delete();
            }
        }
    }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is much simpler. Done

@ChenMichael ChenMichael force-pushed the SPARK-50853-close-temp-shuffle-file-channel branch from a93d0f3 to 2e4667e Compare January 22, 2025 17:06
@LuciferYang
Copy link
Contributor

Thank you for updating. Could you make CI happy @ChenMichael

@ChenMichael
Copy link
Author

I looked at the failing tests and they aren't related to the changes here. They are also passing in the rerun

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @ChenMichael and @LuciferYang .

dongjoon-hyun pushed a commit that referenced this pull request Jan 24, 2025
### What changes were proposed in this pull request?

Currently, there are two implementations of DownloadFileWritableChannel (which is used for writing data fetched to disk), SimpleDownloadWritableChannel and EncryptedDownloadWritableChannel. The latter closes the writable channel in it's implementation of closeAndRead method while the former does not. As a result, SimpleDownloadWritableChannel channel is never closed and is relying on either the finalizer in FileOutputStream or the phantom cleanable in FileDescriptor to close the file descriptor. The change in this PR is to close the channel in SimpleDownloadWritableChannel closeAndRead method.

### Why are the changes needed?

Should be closing file handles when they are not needed anymore instead of relying on finalizer/cleanables to do it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing spark tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49531 from ChenMichael/SPARK-50853-close-temp-shuffle-file-channel.

Authored-by: Michael Chen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4e3b831)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to master/4.0.

Please make a backporting PR to branch-3.5, @ChenMichael .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants